McGill University
University of Paris Saclay
30 Jul, 2025
Asymptotic convergence of stochastic approximation \[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]
Find sufficient conditions under which \(θ_t \to θ^*\)
Since \(\{θ_t\}_{t \ge 0}\) is a stochastic process, we have different notions of convergence: almost sure, mean squared, in probability
Write SA iteration as \[ \frac{θ_{t+1} - θ_t}{α_t} = f(θ_t) + ξ_{t+1} \]
Iterates \(\{θ_t\}_{t \ge 0}\) may be viewed as noisy Euler discretization of the ODE \[ \dot θ = f(θ) \]
So, as time goes to infinity (i.e., \(\sum α_t = ∞\)), the iterates \(\{θ_t\}_{t \ge 0}\) should converge to the equilibrium point of the ODE
Provided noisy is not too large (which roughly translates to \(\sum α_t^2 < ∞\))
(Robbins and Monro 1951) Proposed SA and proved convergence in probability. Many follow up work using the probabilistic approach
(Ljung 1977) ODE approach to SA, automatic classification of signals, recursive identification algos, self-tuning control)
Further developed in (Kushner and Clark 1978)
(Benaim 1996) weak notion of recurrence (chain recurrence for slightly perturbed orbits and its relation with the notion of attractor)
(Benaïm and Hirsch 1999) studying compact limit sets of trajectories for a class of non-autonomous systems (various notions, including asymptotic pseudotrajectory)
Many others. See (Benveniste et al. 1990; Kushner and Yin 2003; Borkar 2008; Prashanth and Bhatnagar 2025)
Consider a few examples of \(f(θ)\) where \(ξ_{t+1} \sim \text{Unif}[-1,1]\) and \(α_t = C/(t+1)\).
In all cases, the ODE \(\dot θ = f(θ)\) is globally asymptotically stable!
Explain sufficient conditions for convergence of stochastic approximation
Try to understand what is happening in the examples
Focus on almost sure convergence. Other forms of convergence have different sufficient conditions
Do not present a detailed literature review but only a few key references
Stochastic approximation. Given a filtration \(\{\ALPHABET F_t\}_{t \ge 0}\), adapted sequences \(\{α_t\}_{t \ge 0}\) and \(\{\xi_t\}_{t \ge 1}\), conisder the iteration:
\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]
F1. There is a unique \(θ^* \in \reals^d\) such that \(f(θ^*) = 0\).
F2. The function \(f\) is globally Lipschitz continuous with constant \(L\). \[ \NORM{f(θ_1) - f(θ_2)}_2 \le L \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]
\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]
N1. Martingale difference noise. The noise is a Martingale difference sequence, i.e., \[ \EXP_t[ ξ_{t+1} ] = 0, \hskip 0.5em a.s. \]
N2. Controlled growth of variance. There exists a \(σ^2\) such that \[ \EXP_t[ \NORM{ξ_{t+1}}^2_2 ] \le σ^2(1 + \NORM{θ - θ^*}_2^2), \hskip 0.5em a.s. \]
R1. \(\sum_{t \ge 0} α_t^2 < ∞\)
R2. \(\sum_{t \ge 0} α_t = ∞\)
Robbins Monro condition
V0. \(θ^*\) is the globally asymptotically stable equilibrium point of the ODE \(\dot θ = f(θ)\).
Assume that the modeling assumptions hold. Then (V0) implies that \(θ_t \to θ^*\) a.s. on the set \(\{ ω : \lim_{t \to ∞} θ_t(ω) < ∞ \}\)
The result does not imply boundedness of iterates!
It only shows that along the sample paths where iterates remain bounded, they converge to the right limit.
(Borkar and Meyn 2000) fluid limit model of ODE and global asymptotic stability of limiting ODE
(Vidyasagar 2023) Sufficient conditions based on the properties of the Lyapunov function
BM There exists a function \(f_{∞} \colon \reals^d \to \reals\) such that \[ \lim_{r \to ∞} \frac{f(r θ)}{r} = f_{∞}(θ), \quad \forall θ \in \reals^d \] and origin is g.a.s. equilibrium of the fluid limit ODE \(\dot θ = f_{∞}(θ)\).
Assume that the modeling assumptions hold. Then
| \(f(θ)\) | (BM) | Converges a.s. |
|---|---|---|
| \(-0.5 θ\) | Yes | Yes |
| \(-θ^3/(1 + θ^2)\) | No | Yes |
| \(-θ\sqrt{\ABS{θ}}\) | No | No |
| \(-θ^3\) | No | No |
There exists a twice-differentiable Lyapunov fn \(V \colon \reals^d \to \reals\) that satisfies
V1. Squared norm equivalent. There exist \(a,b > 0\) such that \[ a \NORM{θ - θ^*}_2^2 \le V(θ) \le b \NORM{θ - θ^*}_2^2, \quad \forall θ \in \reals^d. \]
V2. Smoothness. There exists \(M > 0\) such that \[ \NORM{\GRAD V(θ_1) - \GRAD V(θ_2) }_2 \le 2M \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]
V3. \(\dot V(θ) \coloneq \IP{ \GRAD V(θ)}{f(θ)} \le 0\) for all \(θ \in \reals^d\).
V4. There exists a function \(φ\) in class \(\mathcal{B}\) (i.e., \(φ(0) = 0\) and for all \(0 < ε < M < ∞\), \(\inf_{r \in (ε, M)} f(r) > 0\)) such that \[ \dot V(θ) \le - φ(\NORM{θ - θ^*}_2), \quad \forall θ \in \reals^d. \]
Assume that the modeling assumptions and (V1)–(V2) hold.
Then (V3) and (R1) imply that the iterates \(\{θ_t\}_{t \ge 1}\) are bounded a.s.
In addition, if (V4) and (R2) hold, then \(θ_t \to θ^*\) a.s.
(V1)–(V3) imply local asymptotic stability. Adding (V4) implies global asymptotic stability.
For boundedness of iterates, we only need local asymtotic stability (and does not require \(θ^*\) to be unique root of \(f\))
But for almost sure convergence, we need global asymptotic stability.
| \(f(θ)\) | (BM) | (V1)–(V4) | Converges a.s. |
|---|---|---|---|
| \(-0.5 θ\) | Yes | Yes | Yes |
| \(-θ^3/(1 + θ^2)\) | No | Yes | Yes |
| \(-θ\sqrt{\ABS{θ}}\) | No | Yes | No |
| \(-θ^3\) | No | Yes | No |
Simple illustration of almost sure convergence of stochastic approximation
Variations encountered in applications (especially RL)
Biased noise: Similar analysis goes through provided iterates remain bounded, i.e., \(\sum_{t \ge 0} α_t \EXP_t[ ξ_{t+1} ] < ∞.\)
Markov noise: \(ξ_{t+1} = F(θ_t, w_t) - \EXP_t[F(θ_t), w_t)]\), where \(\{w_t\}_{t \ge 0}\) is a Markov process that is controlled by \(θ_t\).
Asynchronous updates: Only one component of \(θ_t\) is updated at each time.
Multi-time scale updates: Multiple coupled SA updates, operating at different timescales.
Quite a rich area. Lot of interest in relaxing the regularity assumptions on \(f\) and \(V\), depending on specific application.